Introduction to Neutrality Statistics and Signs of Selection


Homework for Lab 5: DUE Friday, November 9th


Readings:

In this lab, we we’ll take our first look at neutrality statistics, which are statistical tests designed to measure if whether SNPs appear to be undergoing selection within a population. We’ll go over what each of the statistics are, what they mean, and how to interpret them, and then we’ll use them on our own data (with the exception of our final statistic, iHS, which is for whole sequence data, not variant calls).

First, let’s get an understanding of what each of these statistics might mean when calculated for our populations…

Tajima’s D

This is a fairly common neutrality statistic. For this lab, we’ve read the original paper by Tajima, which details the development of his D statistic as a way of testing for the neutral mutation hypothesis (or Kimura’s theory of neutral mutation, as we’ve discussed it in class). Tajima’s D tests whether the pattern of mutations found in a particular region of the genome in a population is following the assumptions of neutrality models; in other words, if there has been selection for or against a particular allele in the population, the pattern seen will violate the assumptions of the neutral model.

With Tajima’s D, the closer to 0 the D statistic is, the closer the locus being tested is to meeting the assumptions of neutrality. In other words, when D is close to 0 there are no alleles over- or under-represented at that locus across the population, suggesting that selection is not acting on that locus in the population. If the D statistic is a negative number, this means the population is experiencing purifying or negative selection against an allele at that locus. Conversely, if the D statistic is positive, this means that the population is experiencing positive selection at the locus.

Today, we’ll calculate Tajima’s D values for our populations in order to determine if any directional selection is happening at the UCP1 locus.

Fu and Li’s D and F

Fu and Li’s D and F are also fairly straightforward tests derived from the same \(\Theta\) statistics as Tajima’s D. Like Tajima’s D, these statistics are also based upon Kimura’s neutrality statistic, again meaning that anything deviating from 0 is a violation of neutrality. What these statistics do, specifically, is measure the number of singleton or private mutations (i.e., the number of people within a sample that have a novel mutation specific to themselves). While the D statistic is based on the difference between the number of singletons in the population and the total number of mutations in population, the F statistic is based on the difference between the number of singletons and the average number of nucleotide differences between pairs of sequences. This difference in how they’re derived makes it possible for one statistic to be positive while the other is negative in the same population.

We can interpret these statistics like so:

  • If the statistics are strongly negative, this shows an excess of singletons in the population. This means there are quite a few people whose genomic variants don’t match each other. Such a situation could be from rapid population growth (meaning everyone is closely related and so there’s been no time for mutation to catch up to the increased numbers) or a selective sweep (see below for a formal definition) further back in the population’s history, meaning that strong directional selection for a certain mutation led to an overrepresentation of that mutation (at the expense of other variants at that SNP).

  • If the statistics are positive, this indicates an excess of old/ancestral variants, that have been selected for in the past. In other words, there are very few unique variants, and the variants that are present are carried by a lot of individuals.

We will calculate Fu and Li’s D and F statistics for our data today.

Integrated Haplotype Score (iHS)

Recent positive selection usually happens by selective sweeps (i.e. the rapid spread of a particular haplotype, driven by selection for particular alleles within that haplotype), which not only act upon the SNP that corresponds to the phenotype under selection, but the entire haplotype or linkage region in which that SNP is found. When this happens, the alleles in these selected haplotypes become almost entirely homozygous, because selection on this region of the genome is so strong. An iHS score, or an Integrated Haplotype Score, is a test that can be used to measure the amount of recent positive selection experienced by an allele indirectly by looking for evidence of selective sweeps. This test identifies extended haplotypes (large sets of SNPs on a single region of the chromosome) and looking at how many of the SNPs are homozygous in each haplotype. Mass homozygosity in a particular haplotype is recorded as a high iHS score, while low levels of homozygosity is recorded as a low iHS score.

Learning Outcomes

Part 1: Getting Started

Log in to the SCC and bring up the R Studio window as usual:

module load gcc
module load R
rstudio &

Next, we need to install the package PopGenome in our R Studio space, because the SCC doesn’t have it pre-installed.

#We need to install this package
install.packages("PopGenome")

Next, load in the packages we need to use:

library(vcfR)
library(PopGenome)
library(pegas)

Finally, we need to format our data for analysis. PopGenome is a bit funky about how they look for the files in your directory, so we need to move some files around in our folders to make the GENOME object that the PopGenome package will work with. Specifically, if you haven’t already done so, we need to make a new directory within our working directory, and put a copy of our data into that folder. Do this in your SCC space using the mkdir and cp commands (remember to use your population name rather than YRI!):

mkdir YRI
cp YRI.vcf YRI/YRI.vcf

Now, we can make our GENOME object by directing the function to the folder we just created. Additionally, we’ll make a DNAbin object (as we have done before) to save for later…
Here we’ll make our GENOME object:

YRIgenome <- readData("YRI", format = "VCF")
## |            :            |            :            | 100 %
## |====================================================
YRIgenome
## -----
## Modules:
## -----
##                 Calculation             Description         Get.the.Result
## 1                  readData            Reading data           get.sum.data
## 2          neutrality.stats        Neutrality tests         get.neutrality
## 3             linkage.stats  Linkage disequilibrium            get.linkage
## 4              recomb.stats           Recombination             get.recomb
## 5                F_ST.stats          Fixation index get.F_ST,get.diversity
## 6           diversity.stats             Diversities          get.diversity
## 7              sweeps.stats        Selective sweeps             get.sweeps
## 8                       MKT  McDonald-Kreitman test                get.MKT
## 9              detail.stats        Mixed statistics             get.detail
## 10                       MS   Coalescent simulation                      @
## 11           --------------             -----------          -------------
## 12          set.populations Defines the populations                       
## 13 sliding.window.transform          Sliding window                       
## 14           splitting.data         Splits the data                       
## 15               show.slots        ?provided slots?                       
## 16               get.status  Status of calculations

And here we’ll make our DNAbin object:

YRI <- read.vcfR("YRI/YRI.vcf")
## Scanning file to determine attributes.
## File attributes:
##   meta lines: 130
##   header_line: 131
##   variant count: 207
##   column count: 117
## 
Meta line 130 read in.
## All meta lines processed.
## gt matrix initialized.
## Character matrix gt created.
##   Character matrix gt rows: 207
##   Character matrix gt cols: 117
##   skip: 0
##   nrows: 207
##   row_num: 0
## 
Processed variant: 207
## All variants processed
YRIdna <- vcfR2DNAbin(YRI)
## After extracting indels, 192 variants remain.
## 
Variant 3 processed
YRIdna
## 216 DNA sequences in binary format stored in a matrix.
## 
## All sequences of same length: 192 
## 
## Labels:
## NA18486_0
## NA18486_1
## NA18488_0
## NA18488_1
## NA18489_0
## NA18489_1
## ...
## 
## Base composition:
##     a     c     g     t 
## 0.265 0.321 0.275 0.139 
## (Total: 41.47 kb)

Part 2: Fu and Li’s D and F

The function neutrality.stats from PopGenome conveniently calculates several different neutrality statistics for a given population. We’ll use it now to calculate Fu and Li’s D and F:

#Remember to replace the varible name with your own GENOME object!
neut <- neutrality.stats(YRIgenome)
## |            :            |            :            | 100 %
## |====================================================
get.neutrality(neut)
##       neurality stats
## pop 1 Numeric,9
#To see results: 

neut@Fu.Li.F
##         pop 1
## [1,] 0.140358
neut@Fu.Li.D
##          pop 1
## [1,] 0.1403109

As we can see here, the YRI population has D and F statistics that are pretty close to 0. This suggests neither an excess of singleton mutations in this population, nor an excess of ancestral mutations. Put simply, the UCP1 region in the Yoruban population from Ibadan hasn’t evolved much in recent history, which we might expect from an African population with little experience with extreme cold temperatures.

Before we move on, think about what the results for your population means, based on how we interpret Fu and Li’s D. Is there an excess of singletons? Is there an excess of old/ancestral mutations? What does that tell you about your population’s history? We will come back to this at the end of class.

Part 3: Tajima’s D

Now we’ll look at Tajima’s D! When we ran the neutrality.stats function in PopGenome, it also calcuated our Tajima’s D statistic. We can preview our Tajima’s D results from the PopGenome output…

neut@Tajima.D
##           pop 1
## [1,] 0.07245427

But, we can get more useful information by using the tajima.test function in the pegas package. This function will use the DNAbin object that we created earlier as an input. It will not only give us the same D statistic as was calculated by the neutrality.stats function, which is all well and good, but will also give us a P-value associated with our test. This is important, because it will give us a sense of how significant the results are of our neutrality test.

tajima <- tajima.test(YRIdna)
tajima
## $D
## [1] 0.07245427
## 
## $Pval.normal
## [1] 0.9422404
## 
## $Pval.beta
## [1] 0.8974915

What we see from YRI is, again, what we might expect from a population that has been living in a warm climate for many generations: the UCP1 locus in this population does not appear to be subject to selection. This suggests that the only evolution (remember, that’s a change in allele frequencies over generations, and does not have to be due to selection) happening in the population is selectively neutral.

Interpret your own population’s Tajima’s D value. First, compare the D value given by this function and the one given by the neutrality.stats function. Are they the same? What is the D statistic for your population, and what does it mean in terms of whether and what kind of selection might be occuring on your population? We will revisit this question at the end of the lab.

Part 4: An Example of iHS… with Cows



Now, we will run our example of iHS. To run this example, I’m using the rehh package, which is specifically designed to look for Extended Haplotype Homozygosity, or evidence of selective sweeps. We can use this package to look at an example of calculating an iHS score. For a fun digression from human data, the genomic dataset included in the rehh package is from cows! You can use 1000 Genomes Project data with this package, but that requires you to map and phase the data with other genomic analysis-specific programs such as SHAPEIT and Beagle… that’s a bit more than we’re ready to do. Sooo… we’ll just use the cows. If you’d like to learn more about the cow genome, you can read about (and query) the cow genome here. You can also find the cow genome on Ensembl, but it’s unfortunately not yet annotated there (meaning we cannot yet easily make the connection between a chromosomal location and the genes it contains that are similar to those of humans). Even though our example population is a bit more bovine than usual, as we walk through this example, make some predictions about what an iHS score in your population might look like.

The first thing we’ll do is pull the example files from the package and create a haplohh object, which is what the rehh package reads for analyses:

install.packages("rehh")
library(rehh)
#puts example files in our working directory
make.example.files()
#read in files as a haplohh object
hap <- data2haplohh(hap_file="bta12_cgu.hap",map_file="map.inp", recode.allele = TRUE, chr.name = 12)
## Map file seems OK: 1424  SNPs declared for chromosome 12 
## Standard rehh input file assumed
## Alleles are being recoded according to map file as:
##  0 (missing data), 1 (ancestral allele) or 2 (derived allele)
## Discard Haplotype with less than  100 % of genotyped SNPs
## No haplotype discarded
## Discard SNPs genotyped on less than  100 % of haplotypes
## No SNP discarded
## Data consists of 280 haplotypes and 1424 SNPs

Once this is done, we can run the iHS function. The first line of code looks for the haplotypes within the data, and the second calculates the iHS score for every allele:

ehh <- scan_hh(hap,limhaplo=2,limehh=0.05,limehhs=0.05,maxgap=NA,threads=1)
head(ehh)
##          CHR POSITION    freq_A iHH_A iHH_D iES_Tang_et_al_2007
## F1200140  12    79823 0.1500000    NA    NA                  NA
## F1200150  12   125974 0.4071429    NA    NA                  NA
## F1200170  12   175087 0.3571429    NA    NA                  NA
## F1200180  12   219152 0.2214286    NA    NA                  NA
## F1200190  12   256896 0.1750000    NA    NA                  NA
## F1200210  12   316254 0.3892857    NA    NA                  NA
##          iES_Sabeti_et_al_2007
## F1200140                    NA
## F1200150                    NA
## F1200170                    NA
## F1200180                    NA
## F1200190                    NA
## F1200210                    NA
#this will give us a long list of iHS scores... why read the list when we can just plot it?!
ihs <- ihh2ihs(ehh,freqbin=0.025,minmaf=0.05)

Finally, to take in the entirety of our iHS scores, we’ll graph our results! The dotted line represents the threshhold of significance of each iHS score:

ihsplot(ihs, plot.pval=TRUE, ylim.scan=2, pch=16, main="iHS")

What we’re seeing here are two plots representing a map of cow chromosome 12 (which is homologous to human chromosome 13), demarcated on the x-axis by chromosomal positions of SNPs (the ticks on the x-axis are every 20 Mb, or mega-bases, or 20-million-base increments; cow chromosome 12 is a little over 91 Mb long). The y-axis of the first graph is, more or less, the absolute value of the calculated iHS scores; the second is the untransformed value of iHS across the chromosome. The format of these plots is called a Manhattan plot, because the peaks and valleys of the data as you scan across the chromosome look a bit like the skyline of Manhattan.

You can see that that there are several SNPs that are above/below the iHS score significance threshhold, particularly around 30 Mb along chromosome 12. This means that these SNPs exist in the population as part of massively homozygous blocks in this region of the cow genome, which means there was most likely a selective sweep that happened in this population on a phenotype coded by an allele (or multiple alleles) located in this block 30Mb along chromosome 12! Cows have, in the past, apparently adapted to some selective pressure that has acted to maintain homozygosity across the population in this region of their genome. To discover what it is, we’d have to see what genes are in this homozygous haplotype block, but without annotation it’s a bit too difficult for us to manage in this course. If this were the human genome, we would simply go to Ensembl and query this position range and see what gene regions are contained therein. These genes would then be candidate genes for understanding what selective force has been acting on which phenotype.

Now think about what an iHS score might look like in your population. Based on what we’ve discovered today about inferring selection in our populations, do you think that a selective sweep may have happened in the UCP1 region of your population? Is there anything that you know about the environment and your population’s potential adaptations to that environment that you can use to aid in your analysis?

Part 5: What Do Your Results Mean? Discuss with a partner from class:

As we wrap up this lab, think about the results you’ve generated. You have already been prompted to think about what your results might mean as we’ve gone along; now you have the opportunity to talk to a partner about what you think your results might mean.

Pro-Tip: How to run an iHS analysis on 1000 Genomes Project Data

If you really want to run an iHS analysis on 1000 Genomes Project data, you can… you just need to convert the VCF file for a broader genomic region surrounding your gene of interest into a haplotype file. This is relatively simple to do, but it involves using some unfamiliar python and unix code and some data transformations. It also can take a lot of memory, so I’ll ask you to delete files along the way as you try this out. To demonstrate, let’s try running an iHS analysis on a portion of human chromosome 4 to see if we can find evidence of a positive selective sweep around and related to UCP1.

We’ll first need to download a larger chunk of chromosome 4 than we had previously. We could do this on the entire chromosome, as we did for the cow, but this would take up way over our class’ storage allotment on the SCC. Instead, let’s pick a relatively large chunk - about 30 Mb - of chromosome 4 surrounding UCP1.

To cut down on data size, let’s also restrict this dataset to a population where we might expect to see selection around UCP1, perhaps a population that’s been living in the cold, like the Finns!

To do this, we’ll first need to load our modules. This will include tabix to download the data, and vcftools to modify it. We’ll also need to load the programming language python for a few short commands later:

module load tabix
module load vcftools
module load python

Next, we’ll download our 1000 Genomes Project population panel file and create an isolated list of the FIN samples so that we can subset our data file:

#download the population panel file:
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/integrated_call_samples_v3.20130502.ALL.panel

#get a list of sample IDs for the FIN population:
grep FIN integrated_call_samples_v3.20130502.ALL.panel | cut -f1 > FIN.samples.list

Now, to save on space, we’ll download our extended positions (roughly 15 Mb above and below my gene of interest, UCP1) from chromosome 4 into a zipped file using a familiar tabix command:

tabix -h ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/supporting/GRCh38_positions/ALL.chr4_GRCh38.genotypes.20170504.vcf.gz 4:125000000-155000000 | bgzip -c > CH4_All.vcf.gz

Then, we’ll subset the zipped file for just the FIN data with a familiar vcftools command. While we’re at it, we’ll also filter out all indels and SNPs with a MAF lower than 5%:

vcftools --gzvcf CH4_All.vcf.gz --keep FIN.samples.list --maf 0.05 --remove-indels --recode-INFO AA --recode --stdout | bgzip -c > FIN_CH4.vcf.gz

Running the above code took about 10 minutes, because we’re actually filtering 99 individuals from 2504 while also filtering down 62,551 SNPs!!! That’s a huge amount of data compared to what we’ve been working with so far.

To make sure everything worked appropriately, check out the final file (so far) using the less command (press ‘q’ when you’re done):

gunzip -c FIN_CH4.vcf.gz | less

Given your data looks good, now let’s erase the larger file with all the 1000 Genomes Project individuals, along with other files we’re done with:

rm CH4_All.vcf.gz
rm integrated_call_samples_v3.20130502.ALL.panel
rm FIN.samples.list

Now we’re ready to convert this into a haplotype file and a genome map file for {rehh} using a few relatively unfamiliar steps… these are modified from the Informatics Portal website.

First, we’ll extract our INFO on ancestral alleles:

vcftools --gzvcf FIN_CH4.vcf.gz --get-INFO AA

I have to note that this step is - conceptually - a little complicated because the way the 1000 Genomes Project estimates the ancestral allele at a given position is a bit different from how folks usually think about an ancestral allele (i.e., the allele state of the LCA of the groups you’re testing). You don’t have to worry about it, for now, but you can learn more about it here.

Next, we’ll change lower- to upper-case letters in our INFO file:

cat out.INFO | tr '[:lower:]' '[:upper:]' > out2.INFO

Export the VCF file in IMPUTE format, which is what groups alleles into haplotypes:

vcftools --gzvcf FIN_CH4.vcf.gz --IMPUTE

Delete the first row in our INFO file and replace tabs for spaces:

more +2 out2.INFO | awk '{if($3==$5) print$0,"yes";if($3!=$5) print$0,"no";}' | sed -e 's/\t/ /g' > out3.INFO

Now, there’s a problem here in that IMPUTE is deleting a few loci from the sample. We need to use the ‘join’ command to merge the properly matched columns based on position (column 2 in out3.INFO and column 2 in out.impute.legend), which will automatically then delete non-matches in the output file (both files need to be sorted by the position name for this to work):

join -j 2 -o 1.1,1.2,1.3,1.4,1.5,1.6,1.7,1.8 out3.INFO <(sort -k2 out.impute.legend) > out4.INFO

Merge most recent INFO file and our IMPUTE hap file, then eliminate from the merged file those markers with no ancestral allele info (“.”) or (“N”) or (“-”):

paste out4.INFO out.impute.hap | awk '($5!="." && $5!="N" && $5!="-")' > merged.out

Change the allele codes (0 to 3 and 1 to 4) starting in column 7, and remove potential duplicates:

awk '{for(i=7;i<=NF;i++)if($i==0)$i=3; else $i=4; print}' merged.out | awk '{if(ip[$2]=="") print; ip [$2]=1}' > merged2.out

Prepare {rehh} map file using information from the first 2 columns:

awk '{print $1"-"$2,$1,$2,"1","2"}' merged2.out > mapFIN_CH4.out

#change label 3 to 1 when reference allele = ancestral allele (yes)
#change label 4 to 2 when reference allele = ancestral allele (yes)
#change label 3 to 2 when reference allele = derived allele (no)
#change label 4 to 1 when reference allele = derived allele (no)
#eliminate 6 initial fields in hap file to prepare rehh input hap file

sed -e '/yes/ s/3/1/g' -e '/yes/ s/4/2/g' -e '/no/ s/3/2/g' -e '/no/ s/4/1/g' merged2.out | cut -d " " -f7-  > merged3.out

Transpose to create final haplotype file for {rehh} input:

python -c "import sys; print('\n'.join(' '.join(c) for c in zip(*(l.split() for l in sys.stdin.readlines() if l.strip()))))" < merged3.out > merged4.out

Add first column with numbers in increasing order:

awk '{print NR " "$0}' merged4.out > hapFIN_CH4.out

An finally, remove intermediate files you won’t use anymore:

rm merged.out
rm merged2.out
rm merged3.out
rm merged4.out
rm out.INFO
rm out.impute.hap
rm out.impute.hap.indv
rm out.log
rm out2.INFO
rm out3.INFO
rm out4.INFO

After this whole process, the files mapCH4_FIN.out and hapCH4_FIN.out are the input files for our {rehh} analysis!

Ok… now let’s open Rstudio and try it!

module load gcc
module load R
rstudio &

Once you’re in R, load the {rehh} library (if it throws an error, you may need to install {rehh}; see the main Lab 5 exercise on iHS to see how):

library(rehh)

Then, we need to import our newly made files for the Finns:

hap<-data2haplohh(hap_file="hapFIN_CH4.out",map_file="mapFIN_CH4.out", recode.allele = TRUE, chr.name = 4)

Once those are read in, you should see that our dataset consists of 62510 SNPs split into 198 haplotypes.

Now we can run our extended haplotype homozygosity assessment (this step might take a while, be patient):

ehh<-scan_hh(hap,limhaplo=2,limehh=0.05,limehhs=0.05,maxgap=NA,threads=1)

Retrieve our iHS scores:

ihs <- ihh2ihs(ehh,freqbin=0.025,minmaf=0.05)

And plot them:

ihsplot(ihs, plot.pval=TRUE, ylim.scan=2.5, pch=16, main="iHS")

As you can see, we’ve got a few potential regions where there might be selective sweeps in this part of the genome. There maybe appears to be one or two around the position of UCP1 (~140.5 Mb).

Let’s take a closer look at an actual SNP of interest.

In a recent study in Japan by Nishimura et al. (2017), researchers found that human subjects placed in a cold room for almost two hours showed significant differences in oxygen consumption (metabolism) and increased non-shivering thermogenesis based on having a derived allele (T/C) at a locus in a downstream regulator of UCP1 expression (located −3826 bp from the gene itself) at rs1800592, which is also implicated, in other studies, with obesity and diabetes mellitus.

We can investigate whether the derived version of this SNP is showing greater extended haplotype homozygosity compared to the ancestral version, which would imply a positive selective sweep on that particular SNP. To actually analyze the SNP, we need to check its position on Ensembl (for rs1800592, it’s 140572807). Once we’ve done that, we need to find what number (as in first, second, third, etc) that SNP is in our haplotype file…

This seems complicated, but it’s actually pretty simple. You first need to download your out.impute.legend file to your desktop, and then Import it into Excel as a text file with spaces as the delimiters. Once that’s done, you can insert a new column, Fill it with a Series that increases by one from the first to last SNP, and then you’ve got each SNP marked in numerical order. All you have to do then is Find your SNP name. Once you’ve got the number, enter it into the mrk; in this case, rs1800592 is the 36,102 variant in our file):

res.ehh<-calc_ehh(hap,mrk=36102)

We can see here that there doesn’t seem to be a very clear, extensive difference between the EHH of the ancestral allele vs. the derived allele, but just near the SNP the derived allele has higher EHH, which continues away from the centromere (to the right), suggesting selection for this particular enhancer SNP has occurred in the Finnish population!

The paper also mentioned that there was significant linkage (LD = 1) between rs1800592 and another SNP of interest, rs4956451. Just to be thorough, let’s check out the EHH for this SNP, as well. Remember, if a phenotypic effect is associated with a SNP, but that SNP is in perfect linkage with another, it might actually be the other SNP that’s controlling the phenotype:

res.ehh<-calc_ehh(hap,mrk=36131)

Ok, it really looks like there is extended haplotype homozygosity in the derived allele compared to the ancestral (notice how the red line is consistently higher than the blue around our SNP?) for rs4956451, suggesting positive selection for this SNP in the Finns! In other words, it’s possible that Finns adapted to their cold environment, in part, through an increased derived allele frequency at this locus, which is an enhancer that increases expression of UCP1!

You can also, with {rehh} compare haplotype homozygosity between two different populations, to see if a particular genomic region or SNP has been under differential selection between them (e.g., between Finns and Yorubans). However, that’s getting a bit more advanced than we’ll get for this lab! If you’re interested, you can learn more about these other options by reading through the {rehh} vignette here.

One final exercise we can do, though, is to visualize the haplotype structure around our SNP of interest to give us a better idea of the observed effect of selection. The original use of this is in a paper by Sabeti et al. (2002). This will look like a bifurcating tree, with the thicker branches representing the relative number of individuals maintaining that particular haplotype (thicker = more people in the sample have that haplotype) surrounding the core allele (which is at the vertical dashed line). The nodes of the tree are where the haplotype/linkage is broken by a variant. Let’s check out rs4956451 first:

layout(matrix(1:2,2,1))
bifurcation.diagram(hap,mrk_foc=36131,all_foc=1,nmrk_l=80,nmrk_r=80,
main="Bifurcation diagram (rs4956451, UCP1 enhancer): Ancestral Allele")
bifurcation.diagram(hap,mrk_foc=36131,all_foc=2,nmrk_l=80,nmrk_r=80,
main="Bifurcation diagram (rs4956451, UCP1 enhancer): Derived Allele")

We can see in this figure that the haplotype associated with the derived allele is pretty strongly preserved (this figure goes 80 SNPs to the left and right of our core allele), with fewer breaks in linkage across the genome than the ancestral allele. This further indicates positive selection for the derived allele.

Let’s take a look at our original SNP of interest using this method, rs1800592:

layout(matrix(1:2,2,1))
bifurcation.diagram(hap,mrk_foc=36102,all_foc=1,nmrk_l=80,nmrk_r=80,
main="Bifurcation diagram (rs1800592, UCP1 enhancer): Ancestral Allele")
bifurcation.diagram(hap,mrk_foc=36102,all_foc=2,nmrk_l=80,nmrk_r=80,
main="Bifurcation diagram (rs1800592, UCP1 enhancer): Derived Allele")

That’s more like it! Altough linkage isn’t preserved well in all individuals, particularly on the left (as the we follow the genome towards the centromere), the majority of individuals have preserved the haplotype associated with the derived allele, at least a bit more than is the case with the ancestral. The previous SNP looks more like a clear case of positive selection around the SNP, but this one looks like there has at least been some preservation of the haplotype via selection, as well.